NOV 9 2019 OUTAGE REPORT

A DELAYED POST-MORTEM

Clockwork Sun

Posted by Shaun
Last Updated: 2022-12-28

On November 9th of 2019, there was a pretty complete outage of all services I provide via the Clockworksun Network. So, what happened?

What happened

At around 4AM Eastern time on November 9th, 2019, a stick of RAM in one of the main hypervisors for the network failed. Even better, a maintenance cycle was running at the time, so several virtual servers had maintenance processes in the RAM that failed at the time. Several services thus immediately died, got hung up, or otherwise stopped working. One of these services was a network attached storage server (NAS) that serves web content to the multiple web servers I run.

At 8AM, I woke up, and at 9:30 noticed that something was wrong. I investigated, immediately saw that the hypervisor thought it was missing a stick of RAM, and that it had noticed the deficit around 4 in the morning. I emailed the alerts list and began cleaning up; first restoring services that had been corrupted by the missing RAM, then shutting down the affected node and physically replacing the failed stick. This all took about an hour to go from zero to having the hypervisor back up with its full complement of RAM and beginning to see services return to operation.

At this point, I noticed that the NAS had not come back up like the rest of the services. It turned out that it had been running a backup at the time, which had also been corrupted by the failed RAM stick. I had another backup from last week, but I wanted to try recovering the server without losing a week's worth of work first. I investigated the error codes given from the filesharing services and eventually tracked it down to a corrupted configuration file. After restoring just this one file, the server returned to full operation and the incident was resolved around 11AM, after some additional testing to ensure everything was working as expected.

Root cause

The root cause of the incident was the failed stick of RAM that corrupted several processes during a maintenance window, including the backup I would have normally used to restore service. The secondary cause of the incident was the corrupted configuration file in the NAS, able to be restored from a previous uncorrupted backup.

Measures taken

In order to prevent an incident like this from happening again, I have taken the following measures:

  • Enabled additional notifications and monitoring systems to check for hardware failures and failures of maintenance jobs. (Gmail had apparently been eating these automated reports as "spam" for quite some time)
  • Stocked up on spare RAM sticks to be able to replace any additional failed hardware.
  • Began making plans to migrate the NAS to a different, more streamlined system, as the current server is relatively complex and difficult to troubleshoot.

If you have any other questions or concerns about the outage, feel free to email [email protected].